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Abstract 

A Collective Intelligence (COIN) is a community of interacting re- 
inforcement learning (RL) algorithms designed so that their col- 
lective behavior maximizes a global utility function. We introduce 
the theory of COINs, then present experiments using that theory 
to design COINs to control internet traffic routing. These experi- 
ments indicate that COINs outperform previous RL-based systems 
for such routing that have previously been investigated. 


1 INTRODUCTION 

CHECK ALL FORMATTING INSTRUCTIONS. Send this paper to Applications. 

Collective INtelligences (COINs) are (usually) very large, sparsely connected re- 
current neural nets, whose “neurons” are themselves reinforcement learning (RL) 
algorithms. The distinguishing feature of a COIN is that its dynamics involves no 
centralized control, but only the collective effects of the individual neurons each 
modifying their behavior via their (local) RL algorithms. This restriction holds 
even though the goal of the COIN concerns the system’s global behavior. 

One naturally-occuring COIN is a human economy, where the “neurons” consist of 
individual humans trying to maximize their reward, and the “goal”, for example, 
can be viewed as having the overall system achieve high gross domestic product. 
The robustness and power of naturally-occuring COINs suggests it would be useful 
to construct Artificial COINs (ACOINs), for example as controllers of distributed 




systems. This paper presents a preliminary investigation of this idea for the problem 
of distributed routing of internet traffic. See [10] for work on other applications of 
COINS. 

In designing an ACOIN, we start with a global utility function concerning global 
behavior. Our task is to initialize and then update the neurons’ utility functions so 
that as the neurons improve their utilities, global utility also improves. (We may 
also wish to update the local topology of the net to the same end.) In particular, 
we need to ensure that the neurons do not “frustrate” each other as they attempt 
to increase their utilities. All of this must be done without centralized control. 

We refer to the RL algorithms at each neuron as microlearnering , and the updating 
of the ACOIN is known as macrolearning . For robustness and breadth of appli- 
cability, we assume essentially no knowledge concerning the dynamics of the full 
system, i.e., the macrolearning and/or microlearning must “learn” that dynamics, 
implicitly or otherwise. This rules out any approach that models the full system. 

The problem of designing an ACOIN has never previously been addressed in full — 
hence the need for introducing new concepts like those described below. Nonethe- 
less, this problem is related to previous work in many fields: distributed artificial 
intelligence, multi-agent systems, computational ecologies, adaptive control, com- 
putational economics, game theory (and in particular stochastic game theory, evo- 
lutionary game theory, and game-theoretic mechanism design) , voting and auction 
theory, computational markets, iterated function systems, active walker models, 
self-organized criticality, spin glasses, reinforcement learning, recurrent neural nets, 
Markov decision process theory, network theory, and ant-based optimization. 

For the particular problem of routing, the most relevant previous work is that 
of [3, 8]. Like that previous work, we identify as a separate “neuron” each pair of 
a router and an ultimate destination in the net. That previous work focusses on 
using RL to update the internal parameters of neurons implementing conventional 
shortest path algorithms (SPAs), either using centralized control or (more usually) 
ignoring the problem of neurons working at cross purposes. It does not try to design 
a full-fledged COIN. 

We compared three algorithms for internet routing using numerical simulations. The 
first two are an SPA and an ACOIN where each neuron knows i) the quickest pos- 
sible path to the ultimate destination, for SPA’s; ii) the reward signal-maximizing 
path, for COINs. The third algorithm was an ACOIN using a memory-based mi- 
crolearner [1] having limited knowledge. 

The performance of the perfect-knowledge ACOIN was the theoretical optimum. 
The performance of the perfect-knowledge SPA was PERCENT of this optimum. 
Despite limited knowledge, the memory-based ACOIN outperformed the perfect 
knowledge SPA, reducing the deficit in that algorithm’s performace by PERCENT. 
Note that the performace of the perfect knowledge SPA is an upper bound on the 
performance of any RL-based SPA. 

The next section presents a cursory overview of the mathematics behind ACOINs. 
The following section presents computer experiments concerning that mathematics, 
experiments that establish its power in the context of routing problems. That final 
section presents conclusions and summarizes future research directions. 


2 MATHEMATICS OF COINS 


The mathematical framework for COINs in general is quite extensive [9]. This paper 
concentrates on four of the concepts from that framework: subworlds, factored 
systems, constraint-alignment, and the wonderful-life utility function. 

i) We consider the state of the system across a set of T 4- 1 consecutive timesteps, 
which we write without loss of generality as t e [0, Tj. Let C be the (Euclidean- 

vector-valued) state of neuron 77 at time t. World utility is a function of the state 
of all neurons across all £, which we write a s G((). 

ii) To simplify the design of the ACOIN, we group neurons into subworlds , all 
sharing the same subworld utility. In the ACOINs considered here, subworlds have 
no overlap, and a neuron’s microlearner receives the utility of that neuron’s subworld 
as its reward signal. We indicate the state of neuron <7 in subworld u at time t by 

v t , and by g u {Q we indicate the subworld utility of subworld c o. 

iii) One of the primary desiderata in designing an ACOIN is to have it be subworld- 
factored . This means that a change at time 0 to the state of the neurons in subworld 
uo results in an increased value for &,(£) iff it results in an increased value for G(C). 

For a subworld-factored system, u;’s increasing its own utility does not have “side- 
effects” on the rest of the system that decrease world utility. For these systems, the 
separate neurons pursuing their separate goals do not frustrate each other, as far as 
world utility is concerned. In particular, consider a Nash equilibrium of the system, 
i.e., a state in which no neuron can increase its utility by changing its action, given 
the actions of the other neurons [6, 7]. The following is proven in [us]: 

Theorem 1: If a system is sub world-factored, then any state at time 0 of the system 
that constitutes a Nash equilibrium for the subworld utilites is a critical point of 
world utility. Conversely, a maximum of world utility is such a Nash equilibrium. 

In other words, for a subworld-factored system, although it may be maximized at 
other points as well, world utility is assuredly maximized when each of the neurons’ 
reinforcement learning algorithms are performing as well as they can, given each 
others’ behavior. So we have the correct behavior if and when the learners reach 
an equilibrium. There can be no tragedy of the commons [4], etc. 

iv) A (perfectly) constraint- aligned system is one in which any change to the state 
of the neurons in subworld u at time 0 will have no effects on the states of neurons 
outside of u at times later than 0. 

v) Let ZerOu{Q be defined as the vector £ modified by setting the values of all 
neurons in subworld u> across all time to 0. The wonderful life subworld utility 
(WLU) is gM) s G( C) - G(ZeroUQ)- 

Intuitively, the WLU is analogous to the change in world utility that would have 
arisen if sub world u had never existed. (Hence its name - cf. the Frank Capra 
movie.) Note though that no assumption is made that Zero^Q is consistent with 
the dynamics of the system. In particular, to evaluate the WLU we do not have 
to infer how the system would have evolved if all neurons in u> were set to 0 at 
time 0. So long as we know £ (by observation, in the process of macrolearning that 
occurs at time T), and so long as we know G (set in the problem specification of the 
ACOIN), we can evaluate the WLU. This is true even if we nothing of the dynamics 
of the system. This dynamics-independence is a major strength of the WLU. 

In addition to assuring the correct equilibrium behavior, there exist many other 



theoretical advantages to having a system be subworld- factored [us]. In this paper 
we use computer experiments to establish the applicability of these advantages to 
network routing. In particular, the following theorem, proved in [us], serves as the 
basis for the investigation of this paper: 

Theorem 2: A constraint- aligned system with wonderful life subworld utilities is 
sub world-factor ed . 


3 EXPERIMENTS 

3.1 Network Architectures and mapping to COINs 

In our experiments we concentrated on the two network architectures displayed 
in Figure 1 (note that both are slightly larger than those in [5]). To facilitate 
the analysis, traffic originated only at routers indicated (SRC) , and had only the 
other routers indicated (DEST) as ultimate destinations. Although each router 
can in principle route traffic, only those that are neither originators, nor ultimate 
destinations were designated as “router” in Figure 1 . Note that in both networks 
there is bottleneck in the central router. 



(a) Network A 


(b) Network B 


Figure 1: Network Architectures. 

As is standard in much of traffic network analysis [ 2 ], at any time, all traffic at 
a router is a real-valued number together with an ultimate destination tag. At 
each timestep, each router sums all traffic received in this timestep, to get a total 
load . A running average is kept of each router’s load over a window of the previous 
L timesteps, and that average is run through a load-to-delay function to get the 
router’s total delay at this timestep. To insure a heterogenous pool of routers, dif- 
ferent routers had different load-to-delay functions. This situation more accurately 
reflects a real network where differences in router software and hardware (response 
time, queue length, processing speed etc.) affect the resulting delays. In the exper- 
iments reported here, the load-to-delay functions were W(-) = x 3 for routers 1 and 
3, and W(-) = log(x + 1 ) for router 2 , for both networks. 

In terms of the COIN formalism, we identified the neurons 77 as individual pairs of 
routers and ultimate destinations. So £ f was the vector of traffic sent along all 










links exiting gs router, tagged for rf s ultimate destination, at time t. Each subworld 
consisted of the set all neurons that shared a particular ultimate destination. 

An SPA tries to set t to minimize the sum of the total delays along all possible 

paths to rfs ultimate destination. Note the greedy nature of this scheme. An 
ACOIN instead tries to set t to maximize for the subworld u> containing £. 

For both algorithms, “perfect knowledge” means that at time t, all of the routers 
know the values of £ for time t— 1 , and assume that those values will be the same at 
t. For large enough window sizes L, this assumption will be arbitrarily good in most 
cases, and therefore will allow the routers to make arbitrarily accurate estimates of 
how best to route their traffic, according to their respective routing criteria. 

The load at router r at time t is determined by £. Accordingly, we can encapsulate 
the load-to~delay functions at the nodes by writing the delay at node r at time 
t as W r>t (Q. In our experiments world utility was the summed-total-delay, i.e., 
G(C) = Er,i W r,t{Q- So using the WLU, g u (Q = £ r ,jWv, t (C) - W r , t (ZeroAO)]- 
Now W r>t (Zero UJ (Q) is the delay that would have existed at router r at time t if all 
the traffic from r at t that was destined for uj had not existed. So for our G, the 
WLU is just a sum of differences in delays “caused” by the routing of the neurons 
in subworld u. 

The evaluation of this subworld utility in the memory- based-reasoning COIN was 
trivial to implement. All traffic had a header containing a running sum of the 
differences in total delays encountered in all the routers it has traversed so far. 
Then each ultimate destination summed all such headers it received and echoed 
that sum back to all routers that had routed to it. 

In this way each neuron is apprised of the subworld utility of its subworld that re- 
sulted (in part) from its routing decisions. The memory-based microlearner worked 
by forming a training set where load distribution decisions were mapped to won- 
derful life utility functions. More precisely, for each router-ultimate destination 
pair, the training patterns had the loads on each outgoing link as inputs, and the 
resulting woderful life utilities as outputs. Then the wonderful life utility function 
corresponding to each routing decision was estimated by using a nearest neighbor 
algorithm, and the router sent the packets along the path that would improve its 
wonderful life utilty function. 

3.2 Simulation Results 

The networks discussed above tested under light, medium and heaviy traffic. For 
simulation purposes traffic density corresponded to the number of source-destination 
pairs that are active at any one time. For heavy traffic, all sources were sending 
packets to all destinations at each time step; for light traffic, each source only sent 
packets to half the destinations; and for medium traffic, on the average each source 
sent packets to 75% of the destinations. 

Table 1 reports the average traversal times per individual packet in each of the 
traffic regimes, for the shortest path algorithm with perfect knowledge, the ACOIN 
with perfect knowledge and the ACOIN with memory-based learning. Each entry is 
based on 50 runs and the differences are all statistically significant, including SPA- 
PI vs. ACOIN-RL for Network A under light traffic conditions (i.e., the hypothesis 
that the two means are equal is rejected at the a = 0.01 level for the £-test.) 

Table 1 provides two important observations: First, the wonderful life utility based 
ACOIN by outperforming the SPA when both have perfect knowledge demonstrates 
the superiority of the new routing mechanism. By not greedily pursuing the quickest 


Table 1: Total per Packet Transit Times 


Network 

Load 

SPA-PI 

ACOIN-PI 

ACOIN-RL 

A 

light 

medium 

heavy 

0.53 ± .05 
1.26 ± .07 
2.17 ± .08 

0.45 ± .01 
1.10 ± .01 
1.93 ± .01 

0.50 ± .06 
1.21 ± .06 
2.06 ± .07 

B 

light 

medium 

heavy 

2.13 ± .08 
4.37 ± .10 
6.94 ± .11 

1.92 ± .01 
3.96 ± .01 
6.35 ± .01 

2.05 ± .08 
4.19 ± .08 
6.82 ± .17 


reward (shortest path), the ACOIN settles into a more desirable state that reduces 
the average traversal times for all packets. Second, even when the wonderful life 
utility is estimated through a memory based learner (through information avail- 
able at the local routers), the performance of the ACOIN still surpasses that of the 
perfect knowledge shortest path algorithm. This result not only establishes the fea- 
sibility of ACOIN based routers, but also demonstrates that for this tasks ACOINs 
will outperform any algorithm that attempts at estimating the path information 
through reinforcement /memory based learning. This result follows from the obser- 
vation that the perfect information shortest path algorithm is necessarily a ceiling 
on any algorithm based on routing tables, since all the delays that otherwise would 
have be estimated are readily available. 
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(a) Network A 


(b) Network B 


Figure 2: Total Traversal Times. 


4 DISCUSSION 
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